\section{Introduction}

%1.how virtual machines work on storage
%2.cow and why they are not good
%3.introduce dedup

Virtualization is the engine behind many popular cloud computing platforms.
Amazon, Aliyun and many others have provided public VM clouds that host 
hundreds of thousand virtual machines(VM) which are dynamically created
by users. In such a virtualized environment,
each instance of guest operating system operates
on virutal disks, which are represented as
files called virtual disk image (e.g. .vhd, .vmdk) in the host operating system.
Because those virtual disk images are stored as files in the external storage,
backing up VM data in virtual disks is mainly done by taking snapshots
of virtual disk images,
which is quite different from traditional file system backup. 

The most widely used snapshot method is copy-on-write (CoW). Upon VM image storage system receives a save snapshot request,
it freezes the state of that image file, then all consequent write request will result in the write region being copied
to a incremental snapshot data file. CoW method can finish a snapshot operation instantly, but it has several disadvantages:
first, CoW may affect the general I/O performance due to defered data coping. 
Second, CoW does not seperate backup data and runtime image data,
which have distinct access requirements: runtime image data is directly used by the running VM, 
thus need high throughput, and is very sensitive to latency,
such data must be served with hig cost hardware, but backup data generally only need fair aggregate throughput, 
is not sensitive to latency, thus can be stored in secondary storage devices.
Finally and most important, VM snapshots contain tremendous amount of data duplication, which is nearly impossible to tackle 
if these two kinds of data are tightly coupled together.
 
Data deduplication technique can eliminate such redundancy,
today many D2D backup systems\cite{emc_avamar}\cite{datadomain_whitepaper}.
uses vaiable-size chunking algorithm to detect duplicates in file data.
Chunking divides a data stream into variable length chunks, it has been used to conserve bandwidth\cite{lbfs01}, 
search and index documents in large repositories\cite{bhag07}, scalable storage systems\cite{hydrastor09}, 
store and transfer large directory trees efficiently and with high reliability\cite{jumbo07}.

To chunk a file, starting from the first byte, its contents as seen through a fixed-sized (overlapping) 
sliding window are examined. At every position of the window, a fingerprint or signature of its 
contents, $f$, is computed using hashing techniques such as Rabin fingerprints\cite{rabin81}. 
When the fingerprint meets a certain criteria, such as $f mod D = r$ where $D$, the divisor, 
and $r$ are predefined values; that position of the window defines the boundary of the chunk. 
This process is repeated until the complete file has been broken down into chunks. Next, 
a cryptographic hash or chunk ID of the chunk is computed using techniques such as MD5 or SHA.
After a file is chunked, the index containing the chunk IDs of backed up chunks 
is queried to determine duplicate chunks. New chunks are written to disk and the 
index is updated with their chunk IDs. A file recipe containing all the information 
required to reconstruct the file is generated. The index also contains some metadata 
about each chunk, such as its size and disk location.
How much deduplication is obtained depends on the inherent content overlaps in the data, 
the average size of chunks and the chunking method\cite{poli04}. In general, smaller chunks yield better deduplication.

Inline deduplication is deduplication where the data is deduplicated before 
it is written to disk as opposed to post-process deduplication where backup data 
is first written to a temporary staging area and then deduplicated offline. 
One advantage of inline deduplication is that extra disk space is not required to 
hold and protect data yet to be backed up. 
However, unless some form of locality or similarity is exploited, inline, chunk-based deduplication, 
when done at a large scale faces what has been termed the disk bottleneck problem: to facilitate fast chunk ID lookup, 
a single index containing the chunk IDs of all the backed up chunks must be maintained. 
As the backed up data grows, the index overflows the amount of RAM available and must be paged to disk. 
Without locality, the index cannot be cached effectively, and it is common for nearly 
every index access to require a random disk access. This disk bottleneck severely limits deduplication throughput.

This problem has been addressed by many previous studies. Zhu\cite{bottleneck08} tackle it 
by using an in-memory Bloom Filter and prefetch groups of chunk IDs that are likely to be 
accessed together with high probability. Lillibridge\cite{sparseindex09} break list of chunks 
into large segments, the chunk IDs in each incoming segment are sampled and the segment is 
deduplicated by comparing with the chunk IDs of only a few carefully selected backed up segments. 
These are segments that share many chunk IDs with the incoming segment with high probability.
Deepavali\cite{extreme_binning09} uses Broder's theorem\cite{resemblance97} to find similar files and group them
into the same physical location (bins) to deduplicate against each other.

In cloud storage, we are solving data deduplication problem in a different context of 
data stream deduplication (the D2D case). Because our snapshot storage servive is co-located with runtim VMs,
we have very limited amount of system resource to perform deduplication. For example,
on our typical 8-core, 48GB memory, 12TB disk server, there lives over 20 VMs who are hungrily eating
system resources: some are running map-reduce jobs, some are busy web servers, some are backend databases.
Any behavior that affects user VMs performance or stability is unacceptable.

